Here I aim to demonstrate the use of our work to model the arterial input function for PET imaging using count data. Firstly, I will go through how we arrange the data ready for these models. Next, I will demonstrate the application of each model to this data.
We’ll also load the model functions as well as some helper functions which we’ll use along the way.
source("model_functions.R")
source("helper_functions.R")
I have saved the prepared AIF data in the Rawdata folder. This dataset consists of 10 patients.
# load data
filenames = list.files(path = "../Rawdata", pattern="*.csv") ##name of files
data = tibble(
patient = substr(filenames,1,5), #the first 5 character of filenames
count = map(filenames,~read.csv(paste0("../Rawdata/", .)) %>% select(-X))
)
Conventionally, modelling of the input function is performed by converting all the count data into radioactivity. We also estimate values of the whole-blood-to-plasma ratio (BPR) using blood radioactivity and plasma radioactivity values, which are either modelled or interpolated from a few data points. We also estimate values of the plasma parent fraction using measured values, which are either interpolated or modelled.
In this approach, we model the input function using the original count data, i.e. before converting the data to radioactivity. However, we still require estimates of the BPR and parent fraction over time. For this reason, the data must firstly be processed in a conventional manner with radioactivity to derive estimates of the BPR and plasma parent fraction, as well as for assessing the effects of dispersion correction.
In the data below, then, there will both be count data, as well as processed data generated using the counts data converted to radioactivity. The strategy of this model is to make use of the counts data to allow a us to model the data with a more correct error distribution, but this requires first processing the data in a more conventional manner, and using the counts once again when applying the model itself.
GJM: I think it’s worth spending a little bit of time before here discussing how the BPR, dispersion correction and AIF columns are all based on conventional modelling of the data, i.e. after transforming the counts to radioactivity and doing all the conventional corrections, rather than being with the counts directly. We basically do all the conventional things with the blood data in radioactivity units (i.e. after being corrected from counts), and then when it comes time to model the data, then we use the raw counts again. I think it’s important to explain this just so that readers understand why the data have all these columns etc, many of which are related to the radioactivity and not the counts, which they thought were going to be the main focus. I’ve quickly written a short approximate description above. Do feel free to revise it and customise as you desire.
We’ll start out by looking at one of the individuals within the dataset. The data is as follows:
subject_1 = data$count[1]%>% as.data.frame()
datatable(subject_1)
Here I will plot the AIF calculated in the conventional manner, after transforming arterial blood measurements to their respective arterial plasma estimates using the blood-to-plasma ratio, and transforming arterial plasma values to their respective metabolite-corrected estimates using the plasma parent fraction.
ggplot(subject_1, aes( x = time, y = disp_aif))+
geom_point(shape = "diamond", size = 2)+
labs(x = "Time", y = "AIF",
title = "AIF over time for patient jdcs1")+
theme(axis.title.x = element_text(vjust = 0, size = 15),
axis.title.y = element_text(vjust = 2, size = 15))+
theme_light()
As we can see, the AIF rises sharply, and then slowly descends. We name the time when AIF reaches the peak as \(tmax\). For AIF before the peak, we used simple linear regression to model it. We were mainly interested in AIF after the peak. We would use parametric and non-parametric ways to model data after \(tmax\). For this reason, we split the curve into ascent (asc) and descent (dsc).
# find tmax and slice the curve
data = data %>%
group_by(patient) %>%
mutate(count = map(count,~findtmax(.x)), # the tmax is the one with max aif
count_asc = map(count, ~slice_asc(.x)), # data before tmax
count_dsc = map(count, ~slice_exp(.x))) # data after tmax and time=time-tmax; t_G=t_G-tmax
For regression of count data, we use offsets in the model to describe, for instance, if a sample spends longer in the counter, in which case more counts will be recorded.
The way we calculate AIF: \(disp\_aif = \frac{count\times exp^{\frac{log(2)\times time}{20}}\times 0.037\times (0.07756609407608\times exp^{0.080728586340915\times vol})\times parentFraction}{bpr\times disp\_fct\times delta\times vol}\)
The way we set offsets in the code: offset =log(delta)+log(vol)+log(disp_fct)+(-log(2)/20.364*t_G)+(-log(0.003))+(-0.0807*vol)+(-log(parentFraction))+log(bpr)
This is a bit complicated to follow. Let’s break that down:
pare_data = data %>%
group_by(patient) %>%
mutate(asc_res = map(count_asc, ~acs_inter(.x)), # detect t0 and interpolate aif between t0 and tmax
count_asc = map(asc_res, ~.x$data), # add t0 to the data
asc_mod = map(asc_res, ~.x$segmod), # save model that detecting t0
dsc_mod = map(count_dsc, # fit nonlinear poisson regression for descending part
~kinfitr_gnm(t = .x$time, # time since tmax
t_G = .x$t_G, # time point put in gamma count since injection time
y.sum = .x$count,
delta = .x$delta, # time in the gamma counter
vol = .x$vol,
pf = .x$parentFraction,
bpr = .x$bpr,
disp = rep(1,nrow(.x))
)
),
dsc_mod = map(dsc_mod, ~.x$result), # save model fit data after tmax
asc_pred = map(asc_res, ~.x$pred), # get prediction before tmax
# get prediction after tmax, contain interpolated aif
dsc_pred = map2(dsc_mod,count_dsc,~pred_aif(.x,.y))
) %>%
select(-asc_res)
for (i in 1:nrow(pare_data)){
patient = pare_data$patient[i]
data_line = pare_data$count_dsc[i] %>% as.data.frame()
para_data = pare_data$dsc_pred[[i]]$rsd %>% as.data.frame()
# plot of continuous data
con_data = data_line %>% filter(Method == "Continuous")
plot(con_data$time, log(con_data$disp_aif), col= "grey",
xlab = "time(min)",
ylab = "log(AIF)",
xlim = c(0,90),
ylim = c(-2,5),
main = paste0("Parametric Regression for patient:", patient))
legend(50,5,c("Continuous data"), text.col = "grey",bty = "n")
# plot of discrete data
dis_data = data_line %>% filter(Method == "Discrete")
par(new = TRUE)
plot(dis_data$time, log(dis_data$disp_aif), col= "red",
xlab = "time(min)",
ylab = "log(AIF)",
xlim = c(0,90),
ylim = c(-2,5),
main = "")
legend(50,4.5,c("Discrete data"), text.col = "red",bty = "n")
# plots of parametric regression
par(new = TRUE)
plot(para_data$time_dsc, log(para_data$pred), type = "l", col = "darkorange2", lwd = 2,
xlab = "time(min)",
ylab = "log(AIF)",
xlim = c(0,90),
ylim = c(-2,5),
main = ""
)
legend(50,4,c("Predicted value"), text.col = "darkorange2",bty = "n")
}
For the non-parametric poisson regression, we used \(SCAM\) function to achieve monotone decreasing smooths. We also added a covariate to account for the discrepancy observed between discrete and continuous data.
Following, we showed fitness for non-parametric poisson regression.
Although the predicted value shows poisson regression model fits well, when we plotted the Q-Q plot, it revealed the problem of over-dispersion.
Red lines: the reference line
Grey bands: the reference bands
To resolve the problem of over-dispersion, we changed poisson distribution to negative binomial distribution. Because the \(SCAM\) package does not support negative binomial distribution, we also changed the function to \(GAM\) from the mgcv package. However, this means that our non-parametric model is no longer shape-constrained.
### Solved problem of over-dispersion
Red lines: the reference line
Grey bands: the reference bands
While this model appears to have resolved the issue of the over-dispersion, we were dissatisfied with the fits, as they seem to be overfitting, with upward deviations along what we would expect to a practically monotonic descent.
To this end, we log-transformed time within the model in order to distribute the basis functions more evenly across the curve, with respect to both the number of measurements as well as the expected wiggliness (i.e. second derivative) of the curve. First, we added back tmax for all patients’ data. Then, we log-transformed time value and fit the model. Following shows the regression model, residual plot, and how the model fitted.
These curves look much more like what we would expect them to look like! And the QQ plots still appear to look good, shown below.
| Patient | Tmax | Number_of_data |
|---|---|---|
| jdcs1 | 0.666666667 | 570 |
| jdcs2 | 0.716666667 | 567 |
| mhco1 | 0.95 | 553 |
| mhco2 | 1.066666667 | 546 |
| rwrd1 | 1.15 | 540 |
| rwrd2 | 1.083333333 | 544 |
| xehk1 | 1.35 | 528 |
| xehk2 | 1 | 549 |
| ytdh1 | 0.833333333 | 560 |
| ytdh2 | 0.683333333 | 569 |
GJM: The section above becomes a bit cluttered with all these plots together. I think let’s stick to only the colourful curves as for the other models above, and then perhaps show the QQ plots below in another section. I’ve written a bit above to that end.
Having resolved this issue of wiggliness, we wanted to improve our model further. Especially since our model is no longer shape-constrained, another way to incorporate more conservativism in the model is to model all the individuals at once in a hierarchical model.
Firstly, we found the mean of tmax for all patients and added back tmax for all patients’ data.
GJM: For consistency, it would be nice to see plots like those you show above for the other models too. I think the QQ plot and individual smooths are so simple with one panel for everyone that it’s worth keeping them, but I think it would be nice to show the individual predicted values to get an idea for whether this approach is underfitting the data.